# $\begin{tabular}{ll} Advanced Computer Architectures \\ Theory \end{tabular}$

Christian Rossi

Academic Year 2023-2024

#### Abstract

## The course topics are:

- Review of basic computer architecture: the RISC approach and pipelining, the memory hierarchy.
- Basic performance evaluation metrics of computer architectures.
- Techniques for performance optimization: processor and memory.
- Instruction level parallelism: static and dynamic scheduling; superscalar architectures: principles and problems; VLIW (Very Long Instruction Word) architectures, examples of architecture families.
- Thread-level parallelism.
- Multiprocessors and multicore systems: taxonomy, topologies, communication management, memory management, cache coherency protocols, example of architectures.
- Stream processors and vector processors; Graphic Processors, GP-GPUs, heterogeneous architectures.

# Contents

| 1        | $\operatorname{Intr}$ | oduction                                  |
|----------|-----------------------|-------------------------------------------|
|          | 1.1                   | Architectures' classification             |
|          |                       | 1.1.1 Single instruction single data      |
|          |                       | 1.1.2 Single instruction multiple data    |
|          |                       | 1.1.3 Multiple instructions architectures |
| <b>2</b> | MII                   | $\mathbf{e}_{\mathbf{S}}$                 |
|          | 2.1                   | Characteristics                           |
|          |                       | 2.1.1 MIPS CPU                            |
|          |                       | 2.1.2 Program execution                   |
|          | 2.2                   | MIPS instruction execution                |
|          | Pipelining            |                                           |
|          |                       | 2.3.1 Possible issues                     |
|          |                       | 2.3.2 Data hazards                        |
| 3        | Per                   | ormance evaluation 11                     |
|          | 3.1                   | Introduction                              |
|          | 3.2                   | Speed measures                            |
|          | 3.3                   | Performance measures 1                    |

# Introduction

# 1.1 Architectures' classification

In 1966, Michael Flynn introduced a taxonomy outlining the architecture of calculators. This classification divides architectures into four categories:

- Single Instruction Single Data: utilized by uniprocessor systems.
- Multiple Instruction Single Data: although theoretically possible, this architecture lacks practical configurations.
- Single Instruction Multiple Data: features a straightforward programming model with low overhead and high flexibility, commonly employed in custom integrated circuits.
- Multiple Instruction Multiple Data: known for its scalability and fault tolerance, this architecture is utilized by off-the-shelf microservices.

## 1.1.1 Single instruction single data

The traditional concept of computation involves writing software for serial execution, typically on a single computer with a lone Central Processing Unit (CPU). Tasks are divided into a sequence of discrete instructions executed sequentially, allowing only one instruction to be processed at any given moment. This arrangement is illustrated by the single instruction single data architecture.



Figure 1.1: Single Instruction Single Data (SISD)

In a single instruction single data architecture:

- Single instruction: only one instruction is processed by the CPU in each clock cycle.
- Single data: only one data stream is utilized as input during each clock cycle.

Execution in this setup is deterministic, meaning the outcome is predictable and follows a defined sequence of steps. Single instruction single data architecture architectures represent the most prevalent type of computers.

## 1.1.2 Single instruction multiple data

In the single instruction multiple data architecture, the following characteristics apply:

- Single instruction: all processing units execute the same instruction simultaneously during each clock cycle.
- Multiple data: each processing unit can handle a different data element independently.

This architecture is particularly well-suited for specialized problems with a high level of regularity, such as graphics and image processing.



Figure 1.2: Single Instruction Multiple Data

# 1.1.3 Multiple instructions architectures

Hardware parallelism can be achieved through various methods:

- Instruction-level parallelism: this method harnesses data-level parallelism at different levels. Compiler techniques such as pipelining exploit modest-level parallelism, while speculation techniques operate at medium levels of parallelism.
- Vector architectures and graphic processor units: these architectures leverage data-level parallelism by executing a single instruction across a set of data elements simultaneously.
- Thread-level parallelism: this approach exploits either data-level or task-level parallelism within a closely interconnected hardware model that enables interaction among threads.
- Request-level parallelism: this method capitalizes on parallelism among largely independent tasks specified by either the programmer or the operating system.

Currently, the most common type of parallel computer features:

- Multiple instruction: each processor may execute a different instruction stream.
- Multiple data: each processor may operate with a distinct data stream.

Execution in parallel computing can occur synchronously or asynchronously, and it may be deterministic or non-deterministic.



Figure 1.3: Possible architectures for hardware parallelism

## **MIPS**

# 2.1 Characteristics

MIPS embodies the principles of Reduced Instruction Set Computer (RISC) architecture, focusing on streamlined execution by employing simple instructions within a condensed basic cycle. This design aims to enhance the efficiency of Complex Instruction Set Computer (CISC) CPUs.

As a load-store architecture, MIPS operates such that Arithmetic Logic Unit operands are sourced exclusively from the CPU's general-purpose registers, precluding direct retrieval from memory. Dedicated instructions are thus essential for:

- Loading data from memory into registers.
- Storing data from registers into memory.

A pipeline architecture is a pivotal technique aimed at performance optimization. It capitalizes on the concurrent execution of multiple instructions derived from a sequential execution flow.

Furthermore, the Instruction Set Architecture (ISA) of MIPS encompasses a defined set of operations, instruction formats, supported hardware data types, named storage, addressing modes, and sequencing protocols.



Figure 2.1: MIPS instruction set architecture

## 2.1.1 MIPS CPU

Within a MIPS CPU, the datapath encompasses the necessary components such as storage, functional units (FUs), and interconnects to execute desired operations effectively. In this setup, control points serve as inputs while signals serve as outputs.

The controller, functioning as a state machine, coordinates the activities within the datapath by directing operations based on the desired function and the signals received.



Figure 2.2: MIPS CPU

# 2.1.2 Program execution

At the core, a program is segmented into instructions, with the hardware focusing on individual instructions rather than the entire program. At a lower level, the hardware divides instructions into clock cycles, with lower-level state machines transitioning states with each cycle.

# 2.2 MIPS instruction execution

Each instruction within the MIPS subset can be executed within a maximum of five clock cycles, as outlined below:

- 1. Instruction fetch cycle:
  - Transfer the content of the program counter register to the instruction memory and retrieve the current instruction.
  - Update the program counter to the next sequential address by incrementing it by 4 (since each instruction occupies 4 bytes).
- 2. Instruction decode and register read cycle:
  - Decode the current instruction using fixed-field decoding.
  - Access the register file to read one or two registers as specified by the instruction fields.

• Perform sign-extension of the offset field of the instruction if necessary.

#### 3. Execution cycle:

- For register-register ALU instructions, the ALU performs the specified operation on the operands retrieved from the register file (RF).
- For register-immediate ALU instructions, the ALU performs the specified operation on the first operand retrieved from the RF and the sign-extended immediate operand.
- For memory reference instructions, the ALU computes the effective address by adding the base register and the offset.
- For conditional branches, it compares the two registers read from the RF and calculates the potential branch target address by adding the sign-extended offset to the incremented program counter (PC).

#### 4. Memory access:

- Load instructions entail a read access to the data memory using the effective address.
- Store instructions require a write access to the data memory using the effective address to store the data from the source register read from the RF.
- Conditional branches may update the content of the PC with the branch target address if the conditional test evaluates to true.

## 5. Write-back cycle:

- Load instructions write the data retrieved from memory into the destination register of the RF.
- ALU instructions store the ALU results into the destination register of the RF.



Figure 2.3: MIPS CPU architecture

Below are the durations of each instruction:

| Instruction        | Instruction | Register        | $\mathbf{ALU}$ | Data   | Write | Total   |
|--------------------|-------------|-----------------|----------------|--------|-------|---------|
| type               | memory      | $\mathbf{read}$ | operations     | memory | back  | latency |
| ALU instruction    | 2           | 1               | 2              | 0      | 1     | 6ns     |
| Load               | 2           | 1               | 2              | 2      | 1     | 8ns     |
| Store              | 2           | 1               | 2              | 2      | 0     | 7ns     |
| Conditional branch | 2           | 1               | 2              | 0      | 0     | 5ns     |
| Jump               | 2           | 0               | 0              | 0      | 0     | 2ns     |

The duration of each clock cycle is determined by the critical path established by the load instruction, denoted as T = 8ns (equivalent to a frequency of f = 125MHz). We assume a single-clock cycle execution for each instruction, wherein each module is utilized once within a cycle. Modules utilized more than once within a cycle necessitate duplication for efficiency. Furthermore, to ensure separate functionality, an instruction memory distinct from the data memory is required.

Certain modules must be duplicated, while others are shared across different instruction flows. To facilitate sharing a module between two distinct instructions, a multiplexer is utilized. This device enables multiple inputs to access a module and allows the selection of one input among several based on the configuration of control lines.

In the multi-cycle implementation of CPU, the execution of instructions spans across multiple cycles, with MIPS typically utilizing five cycles. The fundamental cycle is shorter at 2ns, leading to an instruction latency of 10ns. Key aspects of the multi-cycle CPU implementation include:

- Each phase of instruction execution necessitates a clock cycle.
- Modules can be utilized multiple times per instruction across different clock cycles, allowing for potential module sharing.
- Internal registers are required to retain values for subsequent clock cycles. These registers store data to be utilized in future stages of the instruction execution process.

# 2.3 Pipelining

Pipelining is an optimization method aimed at enhancing performance by overlapping the execution of multiple instructions originating from a sequential execution flow. It capitalizes on the inherent parallelism among instructions within a sequential instruction stream.

The fundamental concept involves breaking down the execution of an instruction into distinct phases, also known as pipeline stages. Each stage requires only a portion of the time needed to complete the instruction. These stages are interconnected to form a pipeline: instructions enter at one end, traverse through the stages, and exit from the other end, akin to an assembly line. This facilitates a continuous flow of instruction execution, leading to improved efficiency and throughput.



Figure 2.4: Sequential execution and pipelining execution

In pipelining, each stage of the pipeline corresponds to the time required to advance an instruction by one clock cycle. It's crucial to synchronize the pipeline stages, with the duration of a clock cycle determined by the slower stage of the pipeline, typically 2ns.

The objective is to achieve a balance in the length of each pipeline stage. When stages are perfectly balanced, the ideal speedup resulting from pipelining is equal to the number of pipeline stages. This ensures optimal utilization of the pipeline, enhancing overall performance and efficiency.

In the ideal scenario, we compare a single-cycle unpipelined CPU1 with a clock cycle of 8ns to a pipelined CPU2 with five stages of 2ns each. In this case the latency (total execution time) of each instruction is increased from 8ns to 10ns due to the pipeline overhead. However, the throughput (number of instructions completed in a given time unit) is enhanced by four times: CPU1 completes one instruction every 8ns, while CPU2 completes one instruction every 2ns.

In the ideal scenario, when comparing a multi-cycle unpipelined CPU3 consisting of five cycles of 2ns each to a pipelined CPU2 with five stages of 2ns each. The latency (total execution time) of each instruction remains constant at 10ns. However, the throughput (number of instructions completed in a given time unit) is enhanced by five times: CPU3 completes one instruction every 10ns, while CPU2 completes one instruction every 2ns.

## 2.3.1 Possible issues

A potential concern arises due to the two-stage nature of the register file: read access during the instruction decode stage and write access during the write-back stage. When a read and a write operation target the same register within the same clock cycle, it necessitates the insertion of a stall to prevent issues.

**Definition** (*Optimized pipeline*). An optimized pipeline is achieved when the register file read operation takes place in the second half of the clock cycle, while the register file write operation occurs in the first half of the clock cycle.

Another potential issue is the occurrence of hazards within the pipeline. Hazards arise when there is a dependency between instructions, and the pipelining process causes a change in the order of accessing operands involved in the dependency, thereby preventing the next instruction from executing during its designated clock cycle. Hazards diminish the performance from the ideal speedup achieved by pipelining. Hazards can be categorized into three main types:

• Structural hazards: these occur when different instructions attempt to use the same resource simultaneously. For example, there may be a conflict when both instructions require access to a single memory unit for instructions and data.

- Data hazards: these occur when an instruction tries to use a result before it is ready. For instance, an instruction might depend on the result of a previous instruction that is still in the pipeline.
- Control hazards: these occur when a decision regarding the next instruction to execute is made before the condition for the decision is evaluated. For instance, issues arise during conditional branch execution.

If dependent instructions are executed closely within the pipeline, data hazards become more prevalent.

## 2.3.2 Data hazards

Read after write A data hazard of the "read after write" type occurs when an instruction j attempts to read an operand before instruction i has written to it. This hazard, known as a "dependence" in compiler terminology, arises from a genuine necessity for communication between instructions. The potential solutions for mitigating this hazard include:

- Compilation techniques:
  - Insertion of "nop" (no operation) instructions.
  - Instruction scheduling: the compiler arranges instructions to ensure that dependent instructions are not placed too close together. It attempts to intersperse independent instructions among dependent ones. If independent instructions cannot be found, the compiler inserts "nop" instructions.
- Hardware techniques:
  - Insertion of "bubbles" or stalls in the pipeline.
  - Data Forwarding or Bypassing: this technique involves using temporary results stored in the pipeline registers instead of waiting for the results to be written back to the register file. Multiplexers are added at the inputs of the ALU to fetch inputs from pipeline registers, thus avoiding the need to insert stalls in the pipeline.

Utilizing forwarding allows for resolving this conflict without introducing stalls in most cases. However, for load/use hazards, it is imperative to insert one stall to properly address the issue.

Write after write A data hazard of the "write after write" type arises when instruction j writes operand before instruction i writes it. This situation results in write operations being executed in an incorrect order. Notably, this type of hazard does not occur in the MIPS pipeline since all register write operations take place in the write-back stage. Compiler writers classify this hazard as an output dependence.

Write after read A data hazard of the "write after write" type arises when an instruction j writes operand before instruction i reads it. This scenario leads to the possibility of reading an incorrect value. However, such hazards do not occur in the MIPS pipeline because operand read operations take place in the instruction decode stage, while write operations occur in the write-back stage. Similarly, assuming that register writes in ALU instructions occur in the fourth stage and that two stages are needed to access the data memory, some instructions might read operands too late in the pipeline. Compiler writers classify this hazard as an anti-dependence.

# Performance evaluation

# 3.1 Introduction

Developing software has grown increasingly challenging, reaching a point where manually managing all constraints has become nearly impossible. With computational power at unprecedented levels, processor cores are abundant, yet energy consumption has emerged as a critical limitation. Hence, there's a pressing need for software to be mindful of energy usage and space constraints.

# 3.2 Speed measures

To compare the speed of two computers we can use two metrics:

• Computer system user tries to minimize elapsed time for program execution:

response time : execution time =  $time\_end - time\_start$ 

• Computer center manager tries to maximize the completion rate. In other words it tries to maximize the throughput that is the total amount of work done in a given time.

The two metrics can be compared with the formula:

throughput = 
$$\frac{1}{\text{response time}}$$

This equality holds if there are no overlaps, otherwise we will have a greater throughput.

Usually we consider the frequent case because it is often simpler and can be done faster than the infrequent case.

**Theorem 3.2.1** (Amdahl's law). The performance improvement of a system that can be achieved by optimizing a certain part of the system is limited by the fraction of time during which that part is actually utilized.

Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected. We have that:

$$Speedup_{overall} = \frac{1}{(1 - Fraction_{enhanced}) + \frac{Fraction_{enhanced}}{Speedup_{enhanced}}}$$

As a result the best we could ever hope to do is:

$$Speedup_{overall} = \frac{1}{1 - Fraction_{enhanced}}$$

## Example:

Consider a new CPU 10x faster. Consider a I/O bound server, so 60% time waiting for I/O. The overall speedup will be:

$$Speedup_{overall} = \frac{1}{(1 - Fraction_{enhanced}) + \frac{Fraction_{enhanced}}{Speedup_{enhanced}}} = \frac{1}{(1 - 0.4) + \frac{0.4}{10}} = 1.56$$

Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster.

Corollary 3.2.1.1 (Amdahl's law). If an enhancement is only usable for a fraction of a task we can't speed up the task by more than the reciprocal of 1 minus the fraction

# 3.3 Performance measures

The system's performance can be described as follows:

• Response time: this encompasses the latency resulting from completing a task, which includes factors like disk accesses, I/O activity, and operating system overhead. The elapsed time is calculated as the sum of CPU time and I/O wait time:

elapsed time = 
$$CPU$$
 time +  $I/O$  wait

• *CPU time*: this includes the time spent waiting for I/O operations and corresponds to the CPU's processing time. It can be calculated as follows:

CPU time 
$$(P) = \frac{\text{clock cycles needed to execute } P}{\text{clock frequency}}$$

CPU time (P) = clock cycles needed to execute  $P \times \text{clock}$  cycles time

SLIDE 42